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INTRODUCTION 

The generation of new chemical leads for biological targets 
is a very challenging task. There are several strategies for 
finding new leads, including quasi -random biological screen- 
ing which has played an important role in drug discovery 
for many years. It would be preferable, in our view, to 
incorporate as much explicit design as possible into the lead 
generation process, preferably such that a better understand- 
ing of structure-activity relationships can be gained. The 
design of focused and diverse screening sets using a priori 
hypotheses will give such an insight. In this paper, wc 
describe a novel molecular descriptor, the Diverse Property. 
Derived (DPD) code, that is designed to contain information 
about key molecular and physicochcmical properties of a 
molecule. Wc will discuss its application to the selection 
of a representative screening set. the selection of secondary 
screening sets to obtain more information concerning the 
SAR of a particular target receptor, and the profiling of 
combinatorial libraries. The usefulness of molecular and 
physicochcmical descriptors, such as the DPD code is 
discussed critically. 

The original goals of our studies into molecular similarity 
were to provide a rational framework for selecting repre- 
sentative sets of compounds for biological screening and to 
provide a mechanism for selecting further compounds to 
follow up initial leads. Corporate databases of chemical 
compounds contain a wealth of information and provide a 
very rich source of compounds for screening. At the genesis 
of the project in 1991,' the large size of our corporate 
databases (RPR > 150 000, Rp Agrochcmicals > ^0 000 
compounds) precluded the systematic screening of all 
compounds. Even today, the brute-force method of high- 
throughput screening is still relatively expensive to perform 
m terms of designing and validating assays and providing 
protein, and it may not be possible to screen all possible 
compounds for every screen. An alternative strategy is to 
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abstract a small subset of a large database so that the subset 
represents as many as possible of the key features of that 
database in an economical and nonredundant fashion. This 
requires the a priori formulation of hypotheses, by the team 
selecting the sets, concerning what constitutes a key feature 
and how that feature should be measured. The same 
arguments apply to the design of combinatorial libraries It 
is our experience that the needs of different medicinal 
chemistry projects will emphasize different features- the 
procedure described in this paper attempts to provide a 
framework that avoids such biases. 

Other workers in this field have classified molecules on 
the basis ol the functional groups the molecules contain : 
We have also performed this type of structural (chemical 
family) classification; 1 however, we feel that this type of 
analyse is unsatisfying from the perspective of understanding 
f Iigand-receptor interactions. A receptor or enzyme does nof 
recognize particular atoms or groups, it interacts with the 
properties in space (electrostatic- and orbital-based) projected 
by a certain geometric arrangement of these atoms. Func- 
tional group classification ignores the possibility of bio- 
lsostensm and gives little idea of how similar two groups 
are in terms of their receptor binding properties. A similarity 
metric based on molecular properties was developed in an 
attempt to provide a general measure of molecular similarity 
based around this view of ligand-reccptor interactions and 
to answer some of the objections to functional group 
classification. Molecular properties do not describe specific 
geometries of interaction, and they smear conformational 
space into a single lump: in this sense, molecular properties 
are not the absolute answer to molecular similarity, but they 
are interesting and useful descriptors nonetheless. As pan 
ol this work, several novel molecular descriptors were 
developed to represent the electronic and steric properties 
of a molecule that were statistically uncorrected |o other 
descriptors. Forty-nine molecular descriptors (including 
standard physicochcmical descriptors) were considered; a 
subset of descriptors was selected based on a statistical 
analysis and on our biases about ligand-receptor interactions 
The validation of the method of subset selection was provided 
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This paper describes the construction of a similarity 
measure based on molecular descriptors and properties and 
issues concerning the selection of rational sets and design 
ol combinatorial libraries and gives brief details of the 



by examining several independent databases: two publicly 
available databases, Pharma Projects (PP; 6000 compounds 
in 1991), and the Standard Dru^ File (SDF; 25 000 com- 

independently, derived historically from the dS reS ch Z *°*™««™ have «*« developed during 
^s of, he company: RPRj (UK, >6oSmZ^uX dtS Futore Whcanons and deve.opmen te «e alfol 
RPR_2 (France; > 70 000 compounds) and RPR 3 (U S A 
> 18 000 compounds). As our goal was a smalfset "repre- 
sentative" ol the diversity of a database, rather than of the 
actual proportion of compounds, a partitioning approach was 
followed; a complementary approach involving clustering 
"I •'fingerprints" of substructural information is the subject 
»f another paper.' 

A general screening set selected from the RPR I database 
containing about 350 compounds, was prepared and assayed 
m several screens: the final set of the desired size of ~ 1 000 
compounds contained compounds from all three segments 
of the database. The molecules in this set were chosen to 
provide the widest possible diversity of compounds for test 
given our model of molecular similarity, covering the whole 
range of suitable structures that occur in the database This 
collection has provided a number of hits in biological assays- 
one of these hits will be discussed critically to try to 
understand what role this type of similarity measure can play 
in future diversity studies. 



1. METHODS 

™]]] e C c rUC,Ure of 3 com P° und c an be represented by a 
SMILES code, which contains chemically coherent informa- 
tion about the atom types and bonding patterns in the 
molecule. Many molecular descriptors and properties can 
be estimated from this atom and bond information, for 
instance, hydrophobicity (octanol/water partition coefficient*) 
flexibility, charge distribution, and molecular volume Much 
more information can be calculated using the 3D structures 
generated from these codes; this topic is discussed in another 
paper in this sc;ies.' Software has been written to process 
automatically large numbers of these SMILES codes (derived 
from structural databases) into a master file. A flowchart 
outlining the logic of the method is given in Figure 1 . Each 
line of a master file contains 49 molecular descriptors, and 
46 functional group counts for each compound. A statistical 
and literature analysis of the descriptors was performed, and 
a subset of six key descriptors was chosen. To generate a 



s 

m 



ITS 



if 



^ 3 from a list of eight ran<%niy chosen Jcan^^ 
:;V partition from each of the three segments of the database 
S? (RPR_I , RPR2, and RPR, 3); the actual selection was made 
I^T^^ajedby^n ^xp^rieW medic 
ife - wcase where^ 

fu^her choices ^re provided. : llSil^^^ 
gehe^ 

;th^bin:number,^^ 

database) in a particular parti tion can be loc^tbd alid us&l to 
follow up .initial leads. We now describe each part of the 
process in more detail. 
1.1. Generation of Molecular Descriptors. The mo 

lecular descriptors were generated from three separate 
commercial software packages: atom and group counts were 
produced using GENIE, 8 electronic and flexibility indices 
were calculated by MOLCONN-X/' and CIog/> and CMR 
using Daylight v353. ! " These programs are described in 
more detail below. All other programs were written and 
developed in house. The data are combined into a single 
master file by the MAKE_MASTER program, and the DPD 
ft>r each molecule is computed by the PARTITION 
| program. A brief listing of the descriptors computed is given 
r -in Chart I . Further information is available in ref 9; These 
programs are controlled by a single batch control program, 
MAKE_DPD, to make the production of DPD sets as 
straightforward, error-free, and efficient as possible. 

All initial studies were performed under VAX VMS, using 
DCL scripts to control job submission to the VAX cluster. 
As this platform is no longer supported by Daylight, all 
current work is now performed on an SGI cluster The 
implications of this change will be discussed. 

1.1.1. MOLCONN-X. I he MOLCONN-X software, 
from Kier and Hall, 4 computes a wide range of topological 
indices for a molecular structure. These indices have been 
widely and successfully used in the development of QSAR 
and QSPR equations," demonstrating their power as mo- 
lecular descriptors, in addition to the tc molecular shape 
indices, x molecular connectivity indices, topological state 
indices, and atomic electrotopological state indices, other 
composite indices considered to be of potential use were also 
calculated. These included the flexibility index (The product 
of the k al and a2 shape indices divided by the number of 
vertices.) and the molecular normalized electrotopological 
indices used in this work. We devised the normalized 
molecular electrotopological indices to combine information 
about the electronic interactions and the topological environ- 
ment of each atom into a single overall molecular value. 
Inspection of the results for a diverse sampling of compounds 
and functional groups indicated that the values calculated 
gave a sensible classification of "polarity ' (from a medicinal 
chemistry perception) and were superior to an index calcu- 
lated from atomic partial charges. The standard valence state 
electronegativity index (including perturbations from neigh- 
boring atoms in the molecule) is computed for each atom in 
a molecule^ The normalized index is the sum of the squares 
of the atomic indices, divided by the number of atoms. We 
also devised a descriptor that we have called the aromatic 
density descriptor; this was developed as other potentially 
useful indices we had calculated were too correlated with 



Variable 

Registry Name 
DPD code 
Quality Flag 
#aromatic rings 
V"#h -acceptors > r 
#h-donors 
#rotatable bonds 
molecular volume 
#heavy atoms 
formula weight 
flexibility index 

Electrotopological index (with squares) 

Normalized Electrotopological index 

Electrotopological index (with modulos) 

Normalized Electrotopological index 

Kappa 0 

Kappa I 

Kappa 2 

Kappa 3 

Kappa 4 

Kappa 5 

Kappa 6 

Shannon index 

Total topological index 

flpaths 

Sum intrinsic I values 

Sum delta I values 

Total electrotopological stale index 

#bonds 

^elements 

Idw Bonchev-Trinajstic information indices 
Average Idw 

Idc Bonchev-Trinajstic information indices 

Average Idc 

Terminal group 

Terminal group 3 

Terminal methyl 

Terminal methyl 3 

Wiener number 

Wiener p 

Piatt f 

total Wiener 
KnotP 
KnotVP 
#N atoms 
#() atoms 
#S atoms 

Andrews' binding constant 
ClogP 



^aogPErrbr 
CMR 

CMR Error 
#basic N 
^aniline NH 
#aromatic NH 

J^a^ide NH . 

• i #acidr-;-K/.-. - 
ffalcohol 

fthtoalcohol 

#urea 

thiourea 

#amide 

#thioomide 

#ketone 

#thioketone 

#aldehyde 

fthioaldehyde 

#ester 

#ether 

tthioether 

#nitro 

#aromalicN; 

Sulfoxide 

tfsulfone 

^sulfonamide 

#ni(rile 

#aromaticF 

#N-oxide 

#N-(0,N)H 

#N-(O.N): 

#hydrazineH 

^hydrazine: 

ttetrazole 

#sulfonic acid 

#phosphor acid 

#carbamic acid 

#acid sulfonamide 

#acidic amide 

#acidic groups 

#toxic groups 

#reactive halides 

ffreactive epoxides 

#epoxide 

^sulfonates 

ffanhydride 

#NCS,NCO,COjH 

#organo-P 

#C+. N+ 

tfcations 



other indices of interest (see below). Aromatic density is 
simply the number of aromatic rings in a molecule divided 
by the molar volume (computed by Schroeder's method 13 ). 
We believe that this descriptor represents, in a very ap- 
proximate way, the ability of the molecule to form aromatic 
interactions with the receptor, and some aspects of the shape 
of a molecule. 
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1.1.2. ClogP. The BDRIVE module of the Daylight 
v3.54 software running under VAX VMS was originally used 
for the calculation of ClogP (calculated log P from the 
CLOGP3 algorithm) and CMR (Calculated Molar Refractiv- 
ity) values. The current implementation now uses v441 from 
Daylight running under SGI Irix5.2. This switch is not 
without its efi'ecu. The values of Gog/» computed by the 
two versions were compared for a set of 10 000 compounds 
The results were that the ClogP values for 609 compounds 
changed by 5- 10%, 2255 compounds changed by more than 
10%, of which 1077 were different by 0.2-0.5 log units and 
1140 by more than 0.5 log units. This is consistent with 
the reparameterization of fragments that occurs with each 
new release. 14 but it does imply that to make meaningful 
comparisons between databases. Clog/* values of all data- 
bases should be derived using the same version of the 
program. The mean and spread of the values did not seem 
to be affected. 

1.1.3. GENIE. A GENIE routine (Daylight software 
running under VAX/VMS) was written to derive various 
Iragmental properties from the SMILES codes, using inier- 
preted SMARTS substructure queries. These included the 
following: number of hydrogen-bond acceptor groups; 
number ol hydrogen-bond donor groups; number of aromatic 
rings; number of llexible bonds; and molecular volume In 
addition, frequency counts of 46 functional groups were 
computed. Although these arc not used in the work 
described in this paper, the extra information was generated 
lor luture use in alternative measures of molecular similarity 
The switch to SGI IRIX again has caused difficulties, as the 
GENIE program has not Been implemented on this platform. 
A basic GCL parser* 5 was therefore written in C, using the 
Daylight toolkits, to replicate the functions of GENIE; to 
date, our program can interpret standard SMARTS and 
compound SMARTS statements, additions, conditional state- 
ments, simple multiplications, and print formatting state- 
merits. As this was sufficient to our needs, other functions 
were not implemented. The parser has been useful in several 
other roles, including filtering for toxicity, and lately in 
selection of reagents for combinatorial libraries. 

1.1.4. Filtering of the Data. Molecules in the database 
can be subjected to several optional layers of filtering, based 
on chemical formula, molecular weight, and charge. We 
made the explicit assumption that we would exclude from 
our analysis all compounds that were chemically unsuitable 
lor general screening. The reasoning was that the inclusion 
oi these compounds could add noise to the analysis, as we 
were only interested in compounds that were reasonable for * 
lurther study by medicinal chemists. We have applied this 
principle both in the selection of rational sets and in profiling 
to aid the design of combinatorial libraries. As for the 
GENIE program, the FILTER program is written as a series 
ol substructural queries, interpreted by our GCL parser. The 
FILTER program can be used to Hag molecules that are 
reactive and may bind nonspecifically to proteinaceous 
material (e.g., acid halides), cytotoxic or that exhibit a wide 
range of potent biological activities and so are not suitable 
for general screening (e.g., prostanoids). The filtering criteria 
are given in Table la. Flagged molecules were removed 
from the DPD analysis. 

1.2. Utility Programs. The MAKE MASTER program 
combines all the sources of data into one file; the PARTI- 
TION program analyses each line of trie master file 
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Table 1. The Chemical Filtering Criteria Used in the Form of 
Substructure Queries That Remove Unwanted Structures from ,he 
Database and To Create the POP Databases and the CitSe" 

a. In the Form the Substructural Queries 
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computes the DPD code, and inserts it back into the master 
■lie. Further tillering mechanisms were incorporated into 
these programs: trivial compounds with low molecular 
weight or small numbers of atoms and very large molecules 
can be (lagged, as can compounds for which some of the 
molecular descriptors cannot be accurately computed. This 
normally indicated a problem in the original SMILES code 
that can then be corrected. One error that occurs more than 
others is that the number of marked ring bonds exceeds 10' 
as our version of MOLCONN-X could not interpret the %N 
syntax, an error is produced, ft should be stated that, where 
possible, molecular descriptors and DPD codes are produced 
lor all molecules in the database. The filters were applied 
lor the statistical analysis of the molecular descriptors and 
m our view, should also be applied to the selection of rational 
sets lor general screening; they are also useful indicators in 
the profiling and design of combinatorial libraries, particu- 
larly those intended for general screening. 

The MAKH_DPD program takes as input a file of SMILES 
codes and generates a master file, and. if required, a 
partitioning of the data. There arc several options to select 
the levels of filtering that are used on the data, depending 
on the application. The program was designed to take 
advantage of the implicit parallel computing facilities in a 
VAX cluster, when the tasks can be run in parallel on 
dillercnt machines. A similar facility was written for our 
SGI cluster, but the queuing and job scheduling facilities 
are not as well implemented. 

The DATA JiKARCH program is a data-mining utility that 
can take as input a list of registry numbers, a DPD code or 
a single compound ID and create tiles containing DPD data 
«>r lists ot similar compounds contained in the corporate 
database in various formats. It can also perform several other 
functions that aa beyond the scope of this paper. The output 
can be used to create local databases in any system that can 
import SDF files.'" 

13. Statistical Analysis of the Molecular Descriptors 
The filtered daia in the RPRJ master file were used to 
determine which molecular descriptors should be used to 
classify the database; this, in SAR terms, was our training 
set. The filtering criteria are given in Tables Ia,b. We ' 
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decided to select a set of descriptors that could measure 
hydrophobicity, polarity, flexibility, shape, hydrogen-bonding 
properties, and aromatic interactions, reflecting our biases 
about ligand-receptor interactions. We chose to work in 
chemical space rather than looking at biological activity 
space, because our goal was to derive an assay-independent 
metric. We looked at the literature which gave the original 
derivation and use of each descriptor where possible and 
eliminated most of them because their primary function was 
not to describe hydrophobicity. polarity, flexibility, shape, 
hydrogen-bonding, or aromatirity . For example, the Shannon 
index is useful for describing molecular symmetry, 17 but we 
felt that it would not be useful as a generalized descriptor 
of shape. Preliminary statistical analyses showed that many 
potentially relevant descriptors were highly correlated, this 
led us to develop the aromatic density index, which combined 
potentially relevant information into a reasonably noncor- 
related descriptor. As with the f unctional group counts, we 
still calculate all the descriptors, as they may be useful in 
other similarity metrics, based on different assumptions. A 
final statistical analysis of the 17 molecular descriptors 
<#rotalable bonds, #h-aceeptors, M-donors, molecular vol- 
ume, flexibility index, eltxtrotopological index (*ith squares), 
mirmali/edelectrotopological index, clecirotopologieal index 
<with modulo*), normalized clecirotopologieal index. *paths. 
*2, total topological index, sum intrinsic / values, total 
clecirotopologieal state index, Andrews* binding constant, 
Oog/\ CMR) remaining alter this literature analysis was 
performed using RS/I »" for each compound in an database 
of 42 700 molecules derived from ihe RPRJ collection (the 
remainder having been removed by the various filters); six 
descriptors sto*xl out clearly as being only weakly correlated. 
The other descriptors (for example. CMR) showed much 
higher correlations. The final correlation table is *howri 
below (Table 2a). The largest magnitude correlation between 
any pair of descriptors was 0.5. between ClagP and the 
flexibilifv index. This is perhaps understandable, if we 
assume th::* liable bonds will be mostly composed of 
saturated groups, s an aside, il has been pointed out (hat 
the f log/' values for flexible compounds are themselves 
likely to be overestimates, as the extra groups arc still treated 
additively. not allowing for the possibility of internal collapse 
to bury hydrophobic surface. There is also a correlation of 
-0.4 between ammaticily and flexibility, which again is 
intuitive. The magnitudes of the other correlation coef- 
ficients were of the order of 0.25, indicating that there is 
not much pairwise correlation between the descriptors. This 
is important to the partitioning strategy. Pairwise uncorre- 
cted descriptors pass the first test for orthogonality, so the 
data are more likely to be distributed evenly across the 
descriptor space (the descriptors could be related by multiple 
collincariiies, which would invalidate our remarks about 
orthogonality; we did not investigate this possibility). We 
made the assumption that our descriptors were orthogonal 
and hence that partitioning would be able to divide space 
evenly and so produce a representative sampling. 

Correlation matrices are not the only method for selecting 
key descriptors; 1 " we decided against techniques such as 
principal component analysis because we wanted to maintain 
the essential simplicity of the descriptors. Another method 
would have been to look for the spanning set of descriptors; 
if the set of the six chosen descriptors can explain (using 
principal component analysis) or predict the variance (using 
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Table 2. The Correlation Matrix tor the Molecular Descriotors 
Derived trcm (he RPRJ RPR 2, RPR 3, PP, and SDF Database 



descriptor H-ace H-donor flexih electro Clotf arom , d . 



H -acceptor 

H-donor 

flexibility 

electroiopolugical 

(log/' 

aromatic density 

H -acceptor 

H-donor 

iVHihilHv 

elect n (topological 

no*:/' 

aromatic density 

H- acceptor 
H-donor 
flexibility 
clecirotopologieal 

(lug/* 

aromatic density 
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lie wbi lily 
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Ho*./' 

aroiiialic density 

II -acceptor 
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flexibility 
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aromatic density 
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0.28 0.22 



1.00 



0.23 
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I 0.28 
1.00 
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J. 00 
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100 O.20 
1.00 
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0.00 
0.06 
1.00 



0.43 
0.01 
-0.05 
1.00 



0.2V 
0.17 
0 14 
I.IK) 



-0.16 
-0.19 

0.50 
-0.27 

1.00 



0.36 
-0.19 

0,39 
-0.40 

1.00 



-0.44 
-0.4 V 
0.05 
0 3K 
1.00 



1.00 



SDIDatabase 



0.63 
I 00 



0.15 
0 14 
1.00 



0.15 

0.17 

0.04 
1.00 



0.52 
0.32 
0.03 
1.00 



0.3K 
0.44 
0.17 
(146 
MX) 



-0.54 
-0.50 

0.25 
-0.50 

1.00 



0.13 
-0.10 
-0.40 
-0.10 
0.04 
1.00 

0.06 
-0.22 
-0.42 
"0.1 1 
0.1 1 
1.00 

0.05 
-0.16 
-0.45 
0.30 
0.17 
1.00 

-0.1 1 
O0H 

-0.13 
0.01 

-0.05 
1.00 

0,01 
0.10 
-0.42 

0. 1K 
0 10 

1. (Ml 



partial least squares | in the other descriptors, then they arc 
a spanning set. and the information contained in the other 
descriptors can he said to be largely redundant. Our six 
descriptors are probably not a spanning set, a* our motivation 
in choosing them was not only purely statistical but also 
chemical. Correlation matrices were obtained from the other 
databases examined (RPRJ!. RPRJt, PP, SDF; Table 2b- 
e) using the same protocol" As may be seen from the tables, 
the patterns of correlation values are broadly similar. As 
the databases are independent sources of data and have been 
treated separately, we believe that these correlations are a 
general phenomenon and provide a justification for the 
unsophisticated statistical approach followed. It should be 
noted that the SDF database has some higher correlations 
(five values with magnitude > (1.5), the largest in the 
numbers of h-hond acceptors and donors. (0,63). whereas the 
PP database, which should contain similar compounds (both 
being databases of available drugs), has lower correlations. 
We know that the SDF database also contains dentifrices, 
.spermicides, and disinfectants which we speculate might be 
the cause of these observations, but we have no hard evidence 
for this position, t .cquency histograms of each of the six 
descriptors plotted lor each of the live databases show a 
remarkably similar profile across the databases, again 
indicating that the properties of the descriptors are not tied 
to (he origin of the medicinal chemistry compound databases 
(Figure 2a-f). However, we would be wary of extending 
these conclusions to other types of nonmedicinal chemistry 
databases (for example, the CAS or Available Chemicals 
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Directory databases) without performing a test analysis first 
simply because we have little knowledge ,» the nature and 
behavior o| the data from these sources. We also note that 
statistical comparisons of the data (both means <est.s for 
normal distributions and y/ tests were performed) shewed 
that all the distributions wen? significantly different at the /> 
< 0.005 level. We arc therefore committing a type I error 
(rejecting a hypothesis when statistical tests suggest it should 
be accepted) in saying that the distributions are similar; we 
believe that this is justified by observation, and Ihe difficulties 
created by large population sizes. The large number of data 
points makes it possible to detect very small differences in 
two distributions, differences that arc not necessarily mean- 
ingful. 

1.4. Partitioning of the Data to Create the DPI) 
Descriptor. The next stage of our analysis was to combine 
the six descriptors into a single descriptor, the DPI) code, 
that is a linear combination of the components. Two 
methods were considered: clustering and partitioning. Tech- 
niques lor clustering chemical objects have been well 
reviewed by other researchers.-' 1 For Ihe purpose of cluster- 
ing, the object would be a point in the six-dimensional 
descriptor space, and Ihe l-udidean distance between objects 
would be Hie dissimilarity of the objects, assuming equal 
weighting between the descriptors. A simple composite code 
would then be the cluster identifier after clustering of the 
data. Ihe number of clusters can he set arbitrarily There 
were two lactors (fiat weighed againsl the use of clustering 
in (his siudy: the application .if a clustering method makes 
the assumption that the data is in lad amenable to clustering 
(in other words, most clustering methods will produce a 
clustering whatever ihe data); id the authors' knowledge 
there arc no simple ways of testing if this assumption is 
justified lor a very large dalaset. Certainly, cluster signifi- 
cance tests have been proposed.' 1 ■'• but they are quite 
computationally expensive and not practicable to apply to 
very large datasets. The second and most important factor 
is the lack of generality of the descriptor. If the descriptor 
was defined by the clustering of one database, it is hard to 
define the descriptors for compounds in a second database 
without a large number of expensive distance calculations 



and some arbitrary definitions of duster dimensions Par- 
•'"ornng is best described as a boxing algorithm: each 
descriptor is divided into ranges: a combination of descriptor 
ranges makes a partition or box. The composite descriptor 
is then ellectively the coordinate vector of one of the vertices 
ol the box. At an even simpler level, the coordinate vector 
can be made up of the (integer) names of the lower values 
ol defining ranges. For instance, ethyl bcnzoale. with 
property values of number of H-acceplors = I. number of 
H-donors = (). flexibility = 2.81. elcctrotopology =149 
Hog/> - 2.64. aromatic density = 4.76 would be assigned 
to partition 1 1 1 222. using the descriptor ranges described 
below m Table 3. The complete set of partitions is formed 
by taking all the combinations of all the ranges into which 
the molecular descriptors have been divided. It is completely 
portable between different databases provided the same 
descriptors and ranges are used. We freely acknowledge that 
there air disadvantages to the partitioning algorithm, in Ihe 
arbitrary way in which the ranges must be set. and the 
introduction of edge effects when a partition boundary slices 
between two very similar compounds; an answer to this issue 
may come though the application of fuzzy logic. However 
we fell lhat Ihe portability of the descriptor was key to its 
uselulness. Other workers have also addressed the issue of 
which method, clustering or partitioning, gives the belter 
pertormance: there is not yet a consensus on this subject. ' M 
The initial partitioning study was again performed using 
only Ihe filtered RI'RJ database. Molecular properties were 
calculated lor 42 700 structures. The structures for which a 
value ol (log/' could be computed al a reasonable error level 
(no missing fragments, incorrect bonding), and which had a 
formula weight between 150 and 565 dallons (24 828 
compounds) were used in the statistical analysis. The 
justification lor 'he molecular weight limits were based on 
the extremes of molecular weight found in small molecule 
drugs (metronidazole and prislinamycin). Heavier and lighter 
bioaclive^ compounds can be found, but these are the 
exception rather than the rule The filtering rules are given 
in Table la.h. 

Frequency histograms lor each descriptor were plotted and 
used to select divisions for the partitions. The frequency 
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Table 3. Divisions created for Each Molecular Descriptor' 
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histogram* show . reasonable approximation to normal 
distributions, allowing for the integer nature of the number 
ol hydrogen-bond acceptors or donors. The dividing values 
lor the descriptors were set to split the histograms from 
RPRJ into regions with equal areas under the curve. A 
histogram was produced lor each property (Figure 2a-f) 
There were two exceptions to this binomial pattern, and extra 
_ divisions were set up to allow for them: compounds that 
contain no aromatic rings whirh were assigned to aromatic 
density partition 0 (this has fully aliphatic compounds such 
as steroids and long alkyl chain acids and bases) and 
compounds containing a quarternary nitrogen, which were 
assigned to C1og/> partition 0. The reason for the latter 
assignment is thai the Clog/' values for compounds contain- 
mg quaternary nitrogens are not at all reliable (missing 
fragments). The divisions used (Table 3) give a total of 432 
partitions for all combinations of propertv values, or 576 
partitions if the (log/' class (I (containing onlv quaternary 
N+ compounds) is included. The number of partitions is 
simply the product of the number of ranges for all the 
descriptors and is arbitrary. Pragmatic screening consider- 
ations dictated our choice of 432 partitions, but any number 
ol partitions could be chosen. An increase in the number 
"I ranges increases the resolution of the descriptor. 

Selection of General Screening Sets Using the DPI) 
Partitioning. A database of compounds can be classified 
('Miig the partitioning descriptors and ranges so that the 
compounds fall into one of 432 or 576 partitions The 
partitioning of a daiabase of 25 (MM) compounds (after 
tillering) will put. on average. 50 (0.2'* ) compounds into 
each partition (The average is, of course, dependent on the 
si/c ol the original database The graphs of the distribution 
<»t compounds against percentage partition occupancy are 
giu-n in I igurc 3a-e ). The DPD general screening set is 
made up by selecting one member from a random selection 
ol eight molecules taken from each partition (with the proviso 
that stocks of ihat compound are available!): if none of the 
eight were suitable, a further tranche was taken. The 
assistance of experienced medicinal chemists at this stage is 
invaluable. Not all partitions arc filled, as the structures lhat 
would fall in these partitions may not be chemically or 
pharmaceutical^ feasible, and for partitions with few 
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representatives, there may not be a compound with adequate 
stock levels available. The number of unfilled rjoxesacross 
the full set of 576 partitions is given in Table 4. There are 
40 boxes {!%) that are unfilled across all five databases- 
16 of these are from the ClogP class 0 (which probably 
reflects poor sampling), 9 have flexibility class 3 and ClogP 
class 1.11 have flexibility class I and ClogP class 3, and 
the remaining 4 have extreme values of the aromatic density 
descriptor. Because of the correlations between descriptors 
and basic chemical intuition, one would not expect many 
compounds to have a high flexibility index and a low value 
ol Clog/'. The converse is not true (polyarenes for 
example), but these classes of compounds may not be well 
represented in medicinal chemistry databases. It would also 
be difficult to imagine a smalt organic molecule (<565 
daltons) with high hydrogen-bonding power that was also 
strongly hydrophobic. However, the issue of the missing 
partitions cannot be dismissed so lightly. An arbitrary set 
ol compounds should be spread evenly over the whole of 
property space as described by an orthogonal set of axes 
(descriptors), assuming an adequate population size. If there 
are correlations between descriptors (as there ate in our case) 
the distribution of points will be skewed toward the axis of 
correlation. A thought-experiment can be constructed con- 

X* 0, , a u! Plmed *" XY ^ ace and Wng between 
<»,o> and (10,10), partitioned at intervals of I unit. If the 1 
X- and Y-values are uncorrected, then we would expect that 
all partitions would be equally occupied with mean and 
standard deviation related to the density of points and the 
number ol partitions. If the values are positively correlated 
we would expect a higher occupancy for partitions near to 
the line X - Y and much less near the regions around (0, 10) 
and (10.0). Interestingly, the empty partitions fwhich'are 
6U nypercubes) may well be demonstrating that there are 
higher-order correlations between the descriptors. The fact 
that there are more positive correlations than negative ones 
may also mean lhat the assumption about orthogonality is 
less secure. 

The question then arises of whether the partitions could 
be theoretically adjusted to minimize the number of empty 
partitions. There may be a way of relating the correlation 
with the probability of occupancy, but lhat is beyond the 
scope of this paper. The number of void partitions may be 
reduced by having an irregular partitioning (small intervals 
in regions ol high data population, large elsewhere); that is 
what we tried to achieve qualitatively with our equal-area 
approach. The final DPI) set was initially created by 
selecting one compound from each partition (with an 
available compound) from each of the three segments of the ' 
database (RPRJ. RPRJ!, RPR 3), because of the historical 
diflcrences in the types of compounds in these three 
segments, n was believed that this would give an added 
dimension ol diversity in the final set. So bv selecting three 
representatives where possible from each partition, the goal 
ol a diverse "representative" set of about 1000 compounds 
was achieved. 

It should be clear from the discussion above, that a DPD 
set will not contain a fixed number of compounds. Some 
partitions will be empty, and some will contain only 
compounds that are out of stock. Conversely, partitions that 
contain many compounds ought to be represented by more 
than one compound in the DPD set. 
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The filtered RPRJ database was used to test the useful- 
ness of ihe partitioning paradigm. In this study, the Clog/ J 
class 0 was not included. It was found that 404 out of 432 
partitions contained compounds suitable for screening. Of 
Ihe 404 partitions for which a structure was initially 
identified, a further 61 partitions could not be represented 
in the final set. as either all the compounds were rejected by 
the chemists, or no in-stock compounds could be found, 
giving a new filled partition number of 343. 

1.6. Profiling of Combinatorial Libraries Using the 
DPD Code. The DPD method of partitioning can also be 
used to profile a proposed combinatorial library. The 
normalized distribution of compounds across the DPD 
' partitions is reasonably similar for the five compound 
databases examined (sec Results section). It is relatively 
simple to construct a database of SMILES codes for the 
proposed library (we have developed an in-house C program 



to do this), to compute the molecular descriptors, the DPD 
codes, and then to partition this virtual database using the 
methods described above. A reasonable match of the DPD 
profile of the proposed library against the general DPD 
profile can then be made one of the design criteria that the 
library seeks to meet if it is intended for general screening; 
for focused and biased libraries, the DPD profile can provide 
warning or confirmation of deviation from the profile of the 
reference libraries. We would hesitate to recommend that 
the DPD profile be the sole criterion for design, as we believe 
that pharmacophore- descriptors are more important. We 
do think that a library whose compounds fall mainly in 
partitions that arc very sparsely filled by the compound 
databases in this study should be looked at very carefully 
before synthesis because it has an unusual profile, h is not 
necessary for the library to have a very close similarity in 
terms of its DPD partitioning compared to the reference 
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databases in . this study to be acceptable, providing it falls 
within the well populated partitions. Combinatorial libraries 
can be considerably smaller that the reference databases, 
raising serious objections about inadequate sample size 1576 
members would be required to even have a chance of 
occupying all the filled partitions in our studv). The largest 
common substructure present in all library members may 
have a biasing effect, as we have observed in analogue series 
within the reference databases. The profile should be used 
to highlight possible problems in the design, ratlier than force 
conformity. 

1.7. Finding Related Compounds Using the DPD Code. 
Another beneficial effect of the use of partitioning to set up 
the DPD code is that it is very simple and quick to find all 
compound* in a database that match a query code. For all 
but the largest of databases, a straightforward linear search 
will suffice, but the database could be keyed on the DPD 
code to make search times faster. As we often search our 
databases on several different keys, file inversion is not 
worthwhile. A database of I50K structures can be searched 
in a lew seconds on an SGI lndigo2 R4400. By contrast, a 
cluster based code would require the use of a more expensi ve 
nearest-neighbor algorithm. The partition code can poten- 
tially miss related compounds due to edge effects; in our 
searching program, we allow the option of widening the 
search criteria to include neighboring ranges. Tim allows 
the user to make decisions about what factors might be most 
important in designing a follow-up set of compounds after 
an initial screening hit. If dogp, for instance, was decided 
not 10 he a crucial factor, then the search could be set to 
include all the Gogp ranges; similarly the search can be 
broadened to include less and/or more flexible compounds. 

1 RESULTS 

The MAKE_DPD program was used to partition structures 
from the RPRJ, RPR_2, and RPR 3 databases. Two 
external databases of compounds that show biological 
activity, the Pharma Projects (PP) database and the Standard 
Drug File (SDFK were also classified as a reference sct. :v; ' 1 
The standard sets of default filters and partition values were 
applied. The original work for this study was performed on 
the local VAX cluster. The running times given (Table 4) 
should therefore be taken only as a guide, as the machines 
m the cluster have different specifications. The SGI IRIX 
version of MAKE_DPD benefits from the faster chip sets 
of the SGI computers and runs 30-50 limes faster. The 
numbers of compounds that were left after each stage of 
tillering are also given. The observation that there arc 
relatively largtr reductions in database size after each level 
ol liltcring indicates thai some care must be exercised in 
choosing a random screening set of compounds from a raw 
database, especially when the database contains compounds 
intended tor agrochemical and pharmaceutical screens. 
Although the current implementation has been designed for 
use in a pharmaceutical context, it is simple to change the 
control files to apply a different set of filters more appropriate 
to other applications. 

2.1. Comparison of the Descriptor Distributions and 
Correlations from Different Databases. A statistical 
analysis was performed on each database, to check correla- 
tions of descriptors, the distributions of descriptor values 
and the profile of partitions. The profile of the partitions 
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Table 5. Correlation Matrix lor the Partition Profiles of the 
Various Databases 
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examines the relative numbers of compounds in each 
partition lor all the databases. It has been used to check 
whether the method gives similar classifications on indc- 
pendent databases. Despite their very different origins, the 
distributions of properties in all databases are broadly similar 
the histograms organised by databases are given in Figure 
-a-e. and comparative plots organized by category are given 
in Figure 2f. The observation of the trend leads to the belief 
that the DPD method is generally applicable. 

2.2. Comparison of the Partitioning of Different 
Databases. The profiles of different databases were com- 
pared using the percentage normalized wcupancy of each 
partition ( I00*number of compounds in partition/total num- 
ber of compounds). To make comparison more simple to 
visualize, a reference database, made by combining the 
results ol the RPRJ. RPRJ. lind RPR 3 dalabascs > used 
us a reference in each plot and the data sorted into ascending 
order ol occupancy in the reference database (Figure 3a- 
e>. It is clear that the profiles are broadly similar, again 
confirming the idea that the DPD method can be applied to 
any diverse chemical database. The local discrepancies in 
ihc overall trend can often be traced back to individual project 
lamihes within a database. For example, in column 475 of 
the RPR3 plot (Figure 3c ). corresponding to bin 222 213. 
the RPR_3 database has a much higher occupancy than the 
reference. This corresponds to a family of compounds made 
for an ami hypertension project, which make up 75% of the 
members of the bin. In column 477 of the sorted plot 
corresponding to bin 213 233. families of compounds made 
lor o leukotriene D4 project (43%) and a leukotricne B4 
project <H'Z i. account for the difference. The PP database 
which should not contain large analogue families, again 
shows that the trend of the profile is similar (Figure 3d); 
such a database is of course only representative of biological' 
targets already exploited and will tend to contain series of 
related "me-iiMi" compounds. The extra variation may be 
due to the relatively small size of the PP database, which 
would tend to highlight the absence or presence of com- 
pounds in certain partitions. However, a very similar pattern 
of variation is seen in the SDF database (Figure 3c). which 
would tend to reinforce the arguments concerning the 
presence of analogue series within the data. 

Although the graphical profiles are useful for looking at 
general trends and identifying specific differences, it is not 
easy to quantify the overall similarity of two databases in 
this way. A correlation matrix of the number of compounds 
in each partition is given in (he upper half of Table 5 The 
smallest coefficient is 0.65. a „d the average value is 0.76. 
which is encouraging. 

2.3. Screening Results. The DPD rational set has been 
used in biological screens al RPR since the beginning of 
' " has ken routinely used for screens at the Dafcenham 
Research Centre where a number of weak leads have been 
identified in enzyme-based assays and in whole=cell assays 
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3D pharmacophore model. The final stages of optimization 
would involve fine-tuning and the production of close 
analogues so that a substructure-based similarity measure 
would be of most use at this stage. DPD representative sets 
do not give high quality leads, they give hits. However, the 
hits can be obtained rapidly, and because the sets are 
representative of the diversity of molecular properties present; 
rather than of chemical scries, can be particularly relevant 
for new screens (receptor, enzyme, protein -protein or whole 
cell). The information used in the design of the DPD code 
can also be used to guide further screening and modeling 
studies. ^ 



The activities arc in the range 1-50 /<M Only the Low 
Density Lipoprotein <LDL) ?7 series will be discussed. 

The goal from screening was to identify compounds that 
reduce blood LDL concentrations/ The process of LDL 
production is thought to be controlled in the cell nucleus, 
and as the assay was cell-based/ transport of the compounds 
to the potential site of action was an issue. This motivated 
us to screen the DPD representative screening set, as it was 
selected on a physicochemical basis. Screening gave one 
hit, a diaminopyrimidine compound; multiple hits would 
indicate that some of the descriptors used to partition the 
database (excluding flexibility) were perhaps not appropriate 
for the assay. A follow-up screening set of compounds with 
the same DPD code, to test the hypothesis that the DPD code 
was relevant to activity, gave further hits which were 
analogues of the diaminopyrimidine (1; illustrated in Figure 
4). Low activity flC«. 1.7 //M) compounds based on a 
dibenzamide structure (2) had been identified, 1 but optimiza- 
tion to more potent compounds was elusive. In parallel to 
this work, a three-point 3D pharmacophore model had been 
derived, and 3D searches of the corporate databases using 
this query in ChemDBS-3D had produced screening sets that 
yielded two new lead scries including one of compounds 
related to the DPD hit. The 3D database query was then 
further refined using the common features of the different 
active compounds. From the diaminopyrimidine scries, 
compounds already existing in our corporate registry, were 
identified with activities ofca. 6 nM (compounds 3 and 4 in 
Figure 4). This illustrates the point that DPD codes only 
give initial hits that have physicochemical properties suitable 
for binding. The advances in activity came only through a 
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3. DISCUSSION 
The initial goal of this study was to develop a methodology 
for rationally selecting subsets of corporate databases, to 
enhance the efficiency of lead generation over random 
screening. Any screening strategy should find hit compounds 
quickly, and furthermore should enable those hit compounds 
to be turned into a lead scries. Design of a general screening 
set will allow a greater diversity of the whole database to 
be sampled more rapidly and. given the underlying design 
assumptions, will allow rational approaches to be adopted 
for turning any hits into a lead scries. In particular, similar 
compounds to the hit can be rapidly assembled to give a 
secondary screening set that help establish the validity of 
the first hit. This is not to say that a general screening set 
will always give better results than screening at random, 2 v 
but the results can be analyzed in a clear frame of inference, 
that is, the initial assumptions that went into the design of 
the set. In the same way. the design of combinatorial 
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libraries can be enhanced by requiring thai the library be 
(dis)similar lo a reference set of compounds. For these 
reasons, we have studied several methods for computing 
molecular similarity. 

There have been many studies on molecular similarity as 
recent reviews bear out.- s Our line of reasoning has been 
driven both by theoretical and pragmatic considerations A 
commonly used similarity metric is based on the functional 
groups of paths contained within a molecule. Although this 
method is very powerful when splitting a database up into 
chemical families, a receptor or enzyme does not recognize 
particular functional groups or paths per se; rather it interacts 
with the properties in space projected by these atoms A 
more satisfying metric might be one based on matches of 
molecular pn«pcities. for instance, a Carbo index '" We ruled 
this out on purely practical grounds, as the index, even using 
Gaussian methods.'" is slow to compute and is dependent 
on conformation. In addition, (here is the complication of 
how many maxima in the similarity matches between just 
two molecules should be carried through. Although we are 
convinced of the usefulness of these method in some 
contexts, they did not seem )<• be appropriate for our 
requirements here. Another method would be to use 
Phannacnphoric similarity: that is the subject of another 
paper in this series. 7 and at the lime this study was started 
was not a practical proposition for large databases of 
conlormationally flexible molecules. Given the perceived 
limitations in these other methods, we decided to base our 
similarity descriptors on overall molecular properties: other 
groups have also taken a parallel approach but with very 
different criteria for selecting descriptors." The COUSIN 
program, developed independently by workers at Upjohn 
has grown along very similar lines but wiih the emphasis 
on using an experimental design paradigm for selection of 
the diverse set/' 

A ligand with high affinity for a receptor site will have a 
very high degree of complementarity, that is. the molecule 
will fit snugly into the site and will match the spatial 
hydrogen-bonding and electrostatic profile of the site In 
contrast. „nc would expect that an initial screening hit may 
only have micromolar affinity. This implies a qualitative 
match to the site, that is. the lead is of the right overall 
polarity/hydrophobicily to gel lo the sile and contains a 
fragment that can lit into ihe target site. The issue of 
transport to the site is important in whole-cell assays 
Molecular properties are a reasonable compromise; they 
represent the ensemble average of the individual conforma- 
Hons of a molecule, they contain some notions or the 
properties important in ligand-receptor interactions, and they 
are very quick to compute. The hydrophobic^ and polarity 
measures are properties of the whole molecule and will 
reflect the general environment of the receptor site. The 
hydrogen-bond and sterit descriptors will probe the specific 
nature of the sile. Of course, they are not without their 
limitations, in particular the gross averaging ol'conlormafion 
space ami the absence of dipolc descriptors (to avoid 
difficulties with coordinate frames of reference). However 
we feel that for performing inter- and intralibrary compari- 
sons lo obtain rational screening sets or lo aid the of design 
combinatorial libraries, the use of molecular properties as a 
similarity metric is justifiable. 

It has been argued that any molecular diversity measure 
should be validated by testing to sec how effective it is at 
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2D Similarity 

Figure 5. A schematic graph thai plots biological activity aeainsi 
2D scanty ,o a reference molecule. The co^latfoi of^S 
w.ih 2D smiifarny to ihc most active molecule is often observed 

separating active from inactive molecules in a biological 
assay, and studies to this effect have been performed " We 
do not believe that this prescription will give a proper 
validation of a similarity metric for two reasons: first, the 
dilfieuliy in setting an appropriate activity cui-off and 
second, the absence of any truly representative data sels If 
one were to conduct a retrospective analysis of a typical 
medicinal chemistry study and plot a graph of the biological 
activity of the molecules in the study against their 2D 
substructure similarity to a reference (highly active) mol- 
ccule. one would expect to see a graph similar to that shown 
in Figure 5. The substructure similarity metric is the one 
most keenly perceived by organic chemists. The trend of 
increasing activity with 2D similarity is often observed in 
medicinal chemistry studies, but it does not mean anything 
in terms of l.gand-receptor interactions. The trend is mainly 
the understandable tendency of chemists to reinforce success 
and to make analogues of lead compounds. The better Ihe 
activity, the smaller the changes as fine-tuning occurs which 
is why the 2D measures discriminate well. If the activity 
cut-off is „;t at 10 " M. one would expect 2D similarity 
measures to perform best at separating actives from inactives 
Our metric is targeted at much lower levels of activity ( JfJ" 1 - 

,'m 1 M w ^ simi,ari, y mclric for most interesting range 
-10 > is covered in another paper in this series' 
Perhaps it would be more appropriate to determine for a 
particular similarity measure the cut-off level that gave most 
discrinimation. The other difficulty is the absence of data 
sets that contain consistent biological data against a range 
of targets for a large number of compounds (not just 
analogues). Provision of such a data set is a major challenge. 

The data subjected to statistical analysis were prellltered 
lo remove undesirable structures: about 2W of the RPR I 
database was rejected. This will obviously bias Ihc statistics. 
However, we leel that this procedure is justified, as we were 
irymg to look al profiles of databases of potential pharma- 
ceutical drug molecules (leaving aside anticancer and anti- 
fungal drugs, where some degree of eukaryolic cytotoxicity/ 
reactivity is necessary for action, and the eicosanoids that 
would be handled separately). Our target is the formation 
of a reversible, specific complex between a small molecule 
hgand and a protein receptor. Filtering is justified for this 
frame of reference. 
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3.1. Clustering and Partitioning. Partitioning is not the 
only algorithm that can be used to divide up a database into 
classes. The data can also be clustered into families, on the 
basis of a similarity function. The similarity function 
measures the distance between two objects, and each cluster 
is made up of objects that are close to each other. The 
advantage of clustering is that it does not suffer from edge 
effects at partition boundaries whereas partitioning is more 
portable between independent data sets. Wc have experience 
in clustering databases using both hierarchical and parametric 
methods. The disadvantages of clustering are that the 
programs take a very long time to run for large data sets, 
and the criteria for family inclusion/exclusion are somewhat 
arbitrary. In addition, a similarity function that encompassed 
all six dimensions would need to have relative weights for 
the dimensions. An appropriate similarity metric could be 
obtained through a principal components analysis, but at the 
possible cost of losing the key dimensions we postulated from 
our understanding of ligand-receptor interactions. Partition- 
ing seemed to be more appropriate for large chemical 
databases; this assumption is supported by the similar 
partition profile: derived from several independent databases. 
Other workers have favored clustering. 11 

3.2. The Rational Selection of General Screening Sets. 
The first application of the DPD method was in the rational 
selection of small general screening sets. It can be argued 
with justification that a set consisting of only \% of the 
complete database cannot be truly representative. On the 
other hand, it is very difficult, if not impossible, to say what 
fraction would give a proper sampling. The number of 
partitions that we set up were dictated by the then needs of 
our screening unit and has no theoretical basis. The DPD 
method has been designed so that any number of partitions 
could be created, with the proviso that consistent partitioning 
values need to be used if the DPD code is to remain 
transferable between databases. When analyzing the results 
of assaying a screening set, the underlying assumptions 
behind the partitioning strategy should be taken into con- 
sideration. In an ideal case, each partition would have a 
distinct and unique molecular property profile, and only one 
profile would match the profile of the site. This means that 
only one lead should be found when a DPD set is scanned. 
In real life, the target profile may match more than one 
partition, because a particular descriptor is not relevant to 
binding; this should be detectable. In addition, the DPD 
descriptors are not truly orthogonal, so there will be 
correlation effects. If several hits are found, all apparently 
unrelated in their DPD code, the assumptions on which the 
set was selected are probably not valid lor the assay. 

3.3. Design of Combinatorial Libraries. The DPD 
method can be used to generate a profile lor a proposed 
combinatorial library. The profile of the library can be 
compared to arbitrary reference database (for example, the 
SDF database or ,i corporate registry) using the bin oc- 
cupancy measures. In addition, the Hog/' and molecular 
weight descriptors contained within the DPD databases can 
be plotted as separate frequency histograms and compared 
to distributions from reference databases. We do not of 
course advocate that a design for a combinatorial library be 
rejected because the profiles do not match the reference 
standards exactly. The DPD profile is however an indication 
that the design may need to be closely examined to see what 
is causing the observed differences, although designs focused 




or biased to a particular target may well have, or indeed need, 
a different occupancy. Conversely/if one were so minded, 
the DPD method can easily be inverted to favor libraries 
that are the complement of a given reference database. The 
comparison of libraries has been automated into a single 
C-shell script and the profiling results can be obtained in a 
few minutes and can be used by medicinal chemists routinely. 
In our hands, the DPD profile is used at an early stage in 
the design, to identify combinatorial products with more 
extreme physicochemical properties. These products can be 
examined in the training phase to give an interpolative picture 
of how the other compounds in the initial design will behave. 
For focussed or optimization libraries, where the reference 
library may consist of only a few ligands, the DPD method 
is not particularly useful, other factors like pharmacophore 
descriptors should be given more weight. 

3.4. Validity of Partitioning for Selecting Secondary 
Screening Sets. The DPD method will quickly find 
compounds that are related physicochemically, but which 
may have little substructure similarity. The lead compound 
is examined using the same procedure as before to determine 
its DPD code. Once the code has been determined, then 
other compounds in the same class (and which therefore have 
similar molecular properties) can be sent for secondary 
screening, to try to find other potential lead compounds from 
different chemical families. For a database of 40000 
compounds, we have found on average about 50 follow-up 
compounds. These molecule* should have a much higher 
probability of showing activity, if the assumptions concerning 
the importance of the DPD code are valid for the assay 
concerned, with the added bonus of possibly identifying a 
new chemical series of leads. This last point is important 
as a chemically diverse scries of leads will assist the 
production of a pharmacophore model and will provide a 
choice of synthetic targets to follow up. 

However, there is a genera] weakness inherent in all 
classification procedures, regarding objects that lie on the 
periphery of a family. The classification rules will force an 
object into one family when it may go just as well into 
another one. In partitioning, objects that lie close to the 
dividing boundaries may be miselassificd. This has implica- 
tions for follow-up screening of molecules in the same 
partition as a lead. To avoid missing compounds that just 
fall outside a partition, the surrounding partitions should also 
be tested. However, for a 6-dimensional system, with 
perhaps 50 compounds per partition, this could provide a 
further potential 36 (KM) compounds to be screened, which 
defeats the object of the exercise. For properties such as 
flexibility it would be reasonable to search routinely for 
related compounds which differ only in this parameter, 
particularly for less flexible compounds when the hit is a 
flexible compound. For broader searches, a nearest-neighbor 
list could be constructed using the absolute values of the 
descriptors. We have decided against using this approach, 
as it requires us to construct a weighting scheme between 
the descriptors <for the Cartesian calculation of distance), 
and the scheme may not always be appropriate for every 
situation. We prefer to allow the medicinal chemist to make 
explicit assumptions as to the SAR of the system when 
selecting which descriptor ranges to broaden. However, we 
do not dismiss the other approach, and each case should be 
decided on its merits. The DPD method groups together 
compounds with similar properties, so even though the 
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ft is a very challenging^usk. ^mic^^scwmng sets 

|^AR of a ^icular 
combinatorial libraries Tlie gchei^ appH the 
method has been validated by comparing results from four 
independent compound libraries. General screening sets 
derived using the DPD method are in use within Rhone- 
Poulenc Rarer, and have provided useful hits. The DPD 
method for measuring molecular similarity offers new 
capabilities for comparing and profiling libraries and com- 
pounds and thus for lead generation and exploitation. 
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